This article supplement, intended as a pedagogical tool, provides all of the code necessary to reproduce the case study illustration in Jumpstartitarting the Justice Disciplines: Computational Methods for Qualitative Research in Criminology and Criminal Justice Studies (work in progress). To boost the pedagogical value of this resource, we have provided detailed explanations and commentaries throughout each step.
To reproduce the code in this supplement, readers will need at least some background in the R programming language. There are many excellent resources available to learn the basics of R (and the RStudio integrated development environment or IDE, which we recommend). While certainly not an exhaustive list, below are some of our favourite free resources for learning the R/RStudio essentials you’ll need to follow along with this supplement.
Although not always free, there are also many courses available online through websites like Codecademy, Coursera, edX, udemy, and DataCamp.
The rest of this supplement follows the six stages of the framework for using computational methods in qualitative research developed in the article. These stages are: (1) defining the problem; (2) collecting; (3) parsing, exploring, and cleaning; (4) sampling and outputting; (5) analyzing; and (6) findings and discussion. As we explain below, the bulk of material in this supplement is focused on steps 2, 3, and 4, as these are the steps that involved R programming. More in-depth discussions of the remaining steps 1, 5, and 6 can be found in the article.
The first step in designing a project that incorporates computational methods – as with any research project – is to determine a research question. For the sake of brevity, here we only restate the two overarching research questions that guided our collection and analysis of RCMP news releases. We asked:
How do the Royal Canadian Mounted Police (RCMP) visually represent their policing work in Canada? More specifically, what ‘work’ do the images included in RCMP press releases do with respect to conveying a message about policing and social control?
A more detailed discussion of this first step, including the literature review, can be found in the article. The remainder of this supplement will focus on the steps that involved R programming: collecting; parsing (step 2), exploring, and cleaning (step 3); and sampling and outputting (step 4). To analyze our data (step 5), we took a qualitative approach and relied on the qualitative research software NVivo (which will no doubt be familiar to most, if not all, qualitative researchers). The bulk of the final phase, findings and discussion (step 6), can also be found in the article and is not reproduced at length in this supplement. We do, however, provide several example images from our analysis in the supplement.
The first major step in conducting a web scrape is page exploration/inspection. What the researcher does at this stage is explore the content and structure of the pages they are interested in. The goal is to, first, find the various page elements that one wishes to collect. In this case, as we explain in the article, we are interested in collecting specific data points from thousands of RCMP news releases, including the title of each news release, date, location of RCMP detachment, the main text of the news release, and links to any images contained in the news release.
The second goal is to come up with an algorithmic solution or strategy for collecting this information. A key part of this second step is to carefully examine the source code of the website to determine what tools or libraries will be necessary to execute the scrape. While simpler websites can be scraped using an R package like library(rvest), more sophisticated websites may require that the researcher use an additional set of tools such as library(RSelenium).
Another key part of constructing an algorithmic solution is to determine whether the information can (or should) be collected in one or multiple stages, where each stage represents a different script. We typically conduct our web scrapes in two stages: the index scrape and the contents scrape.
The index scrape works by automatically “clicking” through the page elements containing links to each of sources one wants to obtain (in this case RCMP news articles), extracting link information for each individual source, as well other metadata that may be available. In this first scrape, the primary goal is to obtain each of these links, building an index. Next, we write and deploy the script for the contents scrape that visits each of the links in the index and obtains the desired data points.
The index and contents scrape can be thought of like a Google search. When searching the word “crime” on Google, one arrives first at a page containing various links to other websites. This first page (and subsequent pages) can be thought of as comprising the “index”: it contains the links to the pages we may be interested in visiting and consuming information from. Clicking on any given link in a Google result brings us to the website itself. The material on this website can be thought of as the contents, which would be obtained in the second (aka, contents) scrape.
The first step we took in our index scrape was to write a file (specifically, a comma-separated values or CSV file) to our local drive that we could use to store the results of our scrape. Another means of achieving the same result would be to store the results in RStudio’s global environment, writing the results to your local drive after the scrape completes. Two downsides to this second approach are that you cannot view the results of the scrape until the scrape is completed, and if your scrape fails at some point (which it very likely will, especially on more time intensive tasks), you’ll lose the results you had obtained up until that point.
So, using this first approach, we’ll begin by creating a CSV spreadsheet that contains named columns for the data we’ll be collecting in our index scrape (headline_url, headline_text, etc.). To do this, we’ll use three tidyverse libraries: library(tibble), library(readr), and library(tidyr). (Remember that for this and subsequent steps, you’ll need to install the libraries before loading them, unless you have them installed already. In RStudio, libraries only need to be installed once, but will need to be loaded each time you launch a new session.)
# give the file you'll be creating a name
filename <- "rcmp-news-index-scrape.csv"
# using the tibble function, create a dataframe with column headers
create_data <- function(
headline_url = NA,
headline_text = NA,
date_published = NA,
metadata_text = NA,
page_url = NA
) {
tibble(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
}
# write tibble to csv
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)
Next, we’ll write the script for our index scraping algorithm, which will gather the data from the RCMP’s website and populate the CSV file we created in the last chunk of code. (Assuming the last chunk of code ran successfully, you should have a CSV file titled “rcmp-news-index-scrape.csv” in your working directory.) To conduct our index scrape, we’ll need to install/load an additional library – library(rvest) – that will be used to get and parse the information we want from the news release section of the RCMP’s website. From library(rvest), we will be using six functions: read_html(), html_node(), html_nodes(), html_attr(), html_text(), and url_absolute().
To locate the information we want, which is embedded in the RCMP website’s HyperText Markup Language or HTML code, we’ll be specifying each of the HTML elements that contains each data point (headline_url, headline_text, date_published, metadata_text, and page_url). As we’ve written about elsewhere, obtaining these elements is more an art than a science. There are developer tools built into every browser to help obtain them. Another popular tool is Andrew Cantino and Kyle Maxwell’s incredibly efficient and user-friendly Chrome browser extension “selector gadget”.
It is vitally important when web scraping to always insert a pause into the code, typically a minimum of 3 seconds, which can be achieved using the base R function Sys.sleep(). Pausing the loop after each execution (since there are 13,637 URLs, it will be executed 13,637 times) prevents the web scrape from placing undue stress on a website server.
base_url <- 'https://www.rcmp-grc.gc.ca/en/news?page='
max_page_num <- NA # note that these pages are zero-indexed
scrape_page <- function(page_num = 0) {
# grab html only once
page_url <- paste(base_url, page_num, sep = '')
curr_page <- read_html(page_url)
# zero in on news list
news_list <- curr_page %>%
html_node('.list-group')
# grab headline nodes
headline_nodes <- news_list %>%
html_nodes('div > div > a')
# use headline nodes to get urls
headline_url <- headline_nodes %>%
html_attr('href') %>%
url_absolute('https://www.rcmp-grc.gc.ca/en/news')
# use headline nodes to get text
headline_text <- headline_nodes %>%
html_text(trim = TRUE)
# grab metadata field
metadata <- news_list %>%
html_nodes('div > div > span.text-muted')
# use metadata field to grab pubdate
date_published <- metadata %>%
html_nodes('meta[itemprop=datePublished]') %>%
html_attr('content')
# use metadata field to grab metadata text
metadata_text <- metadata %>%
html_text(trim = TRUE)
# build a tibble
page_data <- create_data(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
# write to csv
write_csv(page_data, filename, append = TRUE)
max_page <- curr_page %>%
html_node('div.contextual-links-region ul.pagination li:nth-last-child(2)') %>%
html_text(trim = TRUE) %>%
as.numeric() %>%
-(1)
max_page_num <- max_page
Sys.sleep(sample(seq(3, 10, by=1), 1))
# recur
if ((page_num + 1) <= max_page_num) {
scrape_page(page_num = page_num + 1)
}
}
# run it once
scrape_page()
Once our index scrape is complete, we can (must!) inspect the results before proceeding any further. To do this, we’ll read our CSV file into R using library(readr)’s read_csv() function. To print and inspect the results, we’ll use the paged_table() function from library(rmarkdown).
index <- read_csv("rcmp-news-index-scrape.csv")
paged_table(index)
Using the contents of the headline_url column, which are the unique URLs for each of the 13,637 news releases published on the RCMP’s website, we can conduct our contents scrape. What we’ll be doing in the contents scrape is visiting each of the 13,367 links in the headline_url column and obtaining further information from each page, in particular the full text of the article and any images contained within it. To keep things simple, we are only going to take the first image from each page in the event that there are more than one image. (Click here for an example of one of the pages we’ll be scraping. You could also copy and paste any of the URLs in the headline_url column into your browser.)
Like we did above, we’ll begin by writing a CSV file to our local drive with named columns that correspond to the information we’re going to collect. We’re going to grab two data points from each of the links: the full text of the article (we’ll name this variable full_text in the CSV file) and any images contained in the news release, if there are any (image_url). (Note that we are not grabbing the image itself at this point, but rather are scraping the URL for the image, which we will use in subsequent steps to download each image.) Additionally, we’re going to save the page url (headline_url) for each article. Although this information is redundant, as we already collected it in the index scrape, we’ll need it to merge the results of our index and contents scrape.
filename <- 'rcmp-news-contents-scrape.csv'
create_data <- function(
headline_url = NA,
full_text = NA,
image_url = NA
) {
tibble(
headline_url = headline_url,
full_text = full_text,
image_url = image_url
)
}
# write once to create headers
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)
And now we can write the code for our contents scrape. We’ll use the lapply() function this time, which will apply our script to each element in a list (in this case, each of the 13,637 URLs in the headline_url column of our “rcmp-news-index-scrape.csv” file).
index_list <- as.list(index$headline_url)
lapply(index_list, function(i) {
webpage <- read_html(i)
full_text <- html_node(webpage, ".node-news-release > div") %>% html_text(trim = TRUE)
try(image_url <- html_node(webpage, ".img-responsive") %>% html_attr("src"))
if(!is.na(image_url)){
image_url <- image_url %>% url_absolute(i)
}
page_data <- create_data(
headline_url = i,
full_text = full_text,
image_url = image_url
)
write_csv(page_data, filename, append = TRUE)
Sys.sleep(3)
})
Finally, let’s combine the results our index and contents scrape into a single dataframe. We’ll save the combined results as a CSV file in our working directory.
# read in the two files
index_scrape <- read_csv("rcmp-news-index-scrape.csv")
contents_scrape <- read_csv("rcmp-news-contents-scrape.csv")
# combine the files using the headline_url column
combined_df <- contents_scrape %>% select(-headline_url) %>% bind_cols(index_scrape)
# save results
write_csv(combined_df, "rcmp-news-df.csv")
As we’ve been doing throughout, we’ll start by reading in and inspecting the data. We’ll look at just the first 10 rows from each variable.
#rm(list=ls()) you may want to clear your global environment at this point
rcmp_news <- read_csv("rcmp-news-df.csv")
paged_table(rcmp_news, options = list(rows.print = 10))